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Abstract 

ROC curves and cost curves are two popular ways of visualising classifier performance, finding appro- 
priate thresholds according to the operating condition, and deriving useful aggregated measures such as 
C^ the area under the ROC curve (AUC) or the area under the optimal cost curve. In this note we present 

some new findings and connections between ROC space and cost space, by using the expected loss over 
a range of operating conditions. In particular, we show that ROC curves can be transferred to cost space 
by means of a very natural way of understanding how thresholds should be chosen, by selecting the 
threshold such that the proportion of positive predictions equals the operating condition (either in the 
form of cost proportion or skew). We call these new curves ROC cost curves, and we demonstrate that 
the expected loss as measured by the area under these curves is linearly related to AUC. This opens up 
a series of new possibilities and clarifies the notion of cost curve and its relation to ROC analysis. In 
addition, we show that for a classifier that assigns the scores in an evenly-spaced way, these curves are 
equal to the Brier curves. As a result, this establishes the first clear connection between AUC and the 
Brier score. 

Keywords: cost curves, ROC curves. Brier curves, classifier performance measures, cost-sensitive eval- 
uation, operating condition. Brier score. Area Under the ROC Curve (AUC). 



1 Introduction 

There are many graphical representations and tools for classifier evaluation, such as ROC curves ifnU Tl. 
ROC isometrics lUl, cost curves HHH, DET curves lfT3l . lift charts lITSll . calibration maps ||3l, among others. 
In this paper, we will focus on ROC curves and cost curves. These are often considered to be two sides of 
the same coin, where a point in ROC space corresponds to a line in cost space. However, this is only true 
up to a point, as a curve in ROC space has no corresponding representation in cost space. It is true that the 
convex hull of a ROC curve corresponds to the lower envelope of all the cost lines, but this is not the ROC 
curve. In fact, the area under this lower envelope has no clear connection with AUC. As a result, cost space 
cannot be used in the same way as ROC spaces, and we find some advantages in one representation over the 
other and vice versa. 

One of the issues with this lack of full correspondence is that the definition of what a cost curve is has 
been rather vague in the literature. In some occasions, only the cost lines are formally defined IS, where 
the curve is just defined as the lower envelope of all these lines. However, this assumes that threshold 
choices are optimal, which is not generally the case. This curve is what we call here 'the optimal cost curve' 
(frequently referred to in the literature as 'the cost curve'). It is worth mentioning that Drummond and 
Holte [5] talk about 'selection criteria' (instead of 'threshold choice methods'), they distinguish between 
'performance-independent selection criteria' and 'cost-minimizing selection criteria', and they show some 
curves using different 'selection criteria'. However, they do not develop the ideas further and they do not 
use this to generalise the notion of cost curve. 

In previous work, we have generalised and systematically developed the concept of threshold choice 
method. For instance, in fTOl we have explored a new instance-uniform threshold choice method while in 
lfl2.l we explore the probabilistic threshold method. In |8| we analyse this in general, leading to a total of 
six threshold choice methods and its corresponding measures. 

In this paper, we are interested in how all this can be plotted in cost space and, in particular, we analyse 
a new threshold choice method which assigns the threshold such that the proportion or rate of positive pre- 
dictions equals the operating condition (cost proportion). This leads to a cost curve where all the segments 
have equal length in terms of its projection over the ;c-axis. In other words, each segment covers a range of 
cost proportions of equal length. 

A first graphical analysis of this curve indicates that each segment corresponds to a point in ROC space, 
and its position with respect to the optimal cost curve gives virtually the same information as the ROC curve 
does. Consequently, we call this new curve the ROC cost curve. It can also be interpreted as a cost-based 
analysis of rankers. Further analysis of this curve shows that the area under the ROC cost curve is a linear 
function of AUC, so doubly justifying the name given to this curve, its interpretation and its applications. 

The paper is organised as follows. Section |2] introduces some basic notation and definitions. In Section 
|3]we refer to the relation between the ROC convex hull and the optimal cost curve. Section|4]introduces one 
of the contributions in the paper by using a threshold choice method which leads to the ROC cost curves. It 
also explains how these curves can be plotted easily and what their interpretation is. Section |5] shows that 
the area under this curve is a linear function of AUC, and demonstrates the correspondence for some typical 
cases (random classifier, perfect classifier, worst classifier). Section|6]analyses when a classifier chooses its 
scores in an evenly-spaced way. In this case, it turns out that the area under the ROC cost curve is exactly 
the Brier score. Section [Tjcloses the paper with some conclusions and future work. 

2 Notation and basic definitions 

In this section we introduce some basic notation and the notions of ROC curve, cost curves and the way 
expected loss is aggregated using a threshold-choice method. Most of this section is reused from 13 . 



2.1 Notation 

We denote by Us{x) the continuous uniform distribution of variable x over an interval 5 C R. If this interval 
S is [0, 1] then S can be omitted. 

Examples (also called instances) are taken from an instance space. The instance space is denoted X and 
the output space Y. Elements in X and Y will be referred to as x and y respectively. For this paper we will 
assume binary classifiers, i.e., Y = {0, 1}. A crisp or categorical classifier is a function that maps examples 
to classes. A probabilistic classifier is a function m : X —)• [0,1] that maps examples to estimates ^(l|x) of the 
probability of example x to be of class 1 . A scoring classifier is a function m : X — )• 9^ that maps examples 
to real numbers on an unspecified scale, such that scores are monotonically related to ^(l|x). In order to 
make predictions in the Y domain, a probabilistic or scoring classifier can be converted to a crisp classifier 
by fixing a decision threshold t on the scores. Given a predicted score s = m{x), the instance x is classified 
in class I ii s > t, and in class otherwise. 

For a given, unspecified classifier and population from which data are drawn, we denote the score density 
for class ^ by /,(: and the cumulative distribution function by F,t- Thus, Fo(?) = J'_^fo{s)ds = P{s < t\0) is the 
proportion of class points correctly classified if the decision threshold is t, which is the sensitivity or true 
positive rate at t. Similarly, Fi (t) = /_„/i {s)ds = P{s < f 1 1) is the proportion of class 1 points incorrectly 
classified as or the false positive rate at threshold t; 1 —Fi{t) is the true negative rate or specificityJM 

Given a data set D C {X,Y) of size n = \D\, we denote by D^ the subset of examples in class k G {0, 1}, 
and set n^ = \Dk\ and Tik = Uk/n. We will use the term class proportion for ttq (other terms such as 'class 
ratio' or 'class prior' have been used in the literature). The average score of class k is sj^ = -^ L(,t,y)eDt ni{x). 
Given any strict order for a data set of n examples we will use the index / on that order to refer to the i-th 
example. Thus, Sj denotes the score of the i-th example and j, its true class. We use / to denote the set of 
indices, i.e. / = l..n. Given a data set and a classifier, we can define empirical score distributions for which 
we will use the same symbols as the population functions. We then have fk{s) = —\{{x,y) G Dfr\m{x) = s}\ 
which is non-zero only in ?i[ points, where «^ < nk is the number of unique scores assigned to instances 
in Dk (when there are no ties, we have «[ = rik)- Furthermore, the cumulative distribution functions and 
Fk{t) = T,s<tfk{^) are piecewise constant with n[ + l segments. 

Fq is called sensitivity and Fi is called specificity. The meaning of Fo{t) can be seen as the proportion 
of examples of class which are correctly classified if the threshold is set at t. Conversely, the meaning of 
I —Fi{t) can be seen as the proportion of examples of class 1 which are correctly classified if the threshold 
is set at t. 

2.2 Operating conditions and overall loss 

When a classification model is applied, the conditions or context might be different to those used during 
its training might. In fact, a classifier can be used in several contexts, with different results. A context can 
imply different class proportions, different cost over examples (either for the attributes, for the class or any 
other kind of cost), or some other details about the effects that the application of a model might entail and 
the severity of its errors. In practice, an operating condition or deployment context is usually defined by 
a misclassification cost function and a class distribution. Clearly, there is a difference between operating 
when the cost of misclassifying into 1 is equal to the cost of misclassifying 1 into and doing so when the 
former is ten times the latter. Similarly, operating when classes are balanced is different from when there is 
an overwhelming majority of instances of one class. 

One general approach to cost-sensitive learning assumes that the cost does not depend on the example 
but only on its class. In this way, misclassification costs are usually simplified by means of cost matrices. 



'We use for the positive class and 1 for the negative class, but scores increase with p(l|x). That is, a ranking from strongest 
positive prediction to strongest negative prediction has non-decreasing scores. This is the same convention as used by, e.g., 1111 . 



where we can express that some misclassification costs are higher than others 161. Typically, the costs of 
correct classifications are assumed to be Oq This means that for binary classifiers we can describe the cost 
matrix by two values Ck > 0, representing the misclassification cost of an example of class k. Additionally, 
we can normalise the costs by setting b = cq + ci and c = co/b; we will refer to c as the cost proportion. 
Since this can also be expressed as c = (1 +ci/co)^\ it is often called 'cost ratio' even though, technically, 
it is a proportion ranging between and 1. We can see the dependency between b and cq which leaves just 
one degree of freedom, and we can set one of them constant. Consequently, choosing b constant we see that 
it only affects the magnitude of the costs but is independent of the classifier. We set b = 2 so that loss is 
commensurate with error rate (which just assumes cq = c\ = 1). 

The loss which is produced at a decision threshold t and a cost proportion c is then given by the formula: 

Q,{t;c)^cono{l-Fo{t))+ciniFi{t) (1) 

= 2{c;ro(l -Fo(0) + (l -c)7riFi(0} 

We often are interested in analysing the influence of class proportion and cost proportion at the same time. 
Since the relevance of cq increases with tiq, an appropriate way to consider both at the same time is by the 
definition of skew, which is a normalisation of their product: 

A CqUo CTIq 
Z = = (2) 

coTTo + ciTTi c7ro + (l-c)(l-7ro) 
It follows that c = ^^ _^_/i^^\(i_„ ) ■ From Eq. (fill we obtain 

Qc{t;c) 



CoTTo + CiTTi 



z{\-Fo{t)) + {\-z)F,{t)^Q-St;z) (3) 



This gives an expression for loss at a threshold t and a skew z- We will assume that the operating condition 
is either defined by the cost proportion (using a fixed class distribution) or by the skew. We then have the 
following simple but useful result 

Lemma 1. Ifno = TTi then z = c and Qz{t;z) = lQc{t;c). 

Proof. If classes are balanced we have cqUq + ciTTi = b/2, and the result follows from Eq. ^ and Eq. (J3]l. 

D 

This further justifies taking b = 2, which means that Q^ and Qc are expressed on the same 0-1 scale, and, 
as said above, are also commensurate with error rate which assumes cq = ci = 1. The upshot of Lemma[T]is 
that we can transfer any expression for loss in terms of cost proportion to an equivalent expression in terms 
of skew by just setting tiq = n\ = 1/2 and z = c. 

In many real problems, when we have to evaluate or compare classifiers, we do not know the cost 
proportion or skew that will apply during application time. One general approach is to evaluate the classifier 
on a range of possible operating points. In order to do this, we have to set a weight or distribution on cost 
proportions or skews. In this paper, we will consider the continuous uniform distribution U. 

A key issue when applying a classifier to several operating conditions is how the threshold is chosen 
in each of them. If we work with a crisp classifier, this question vanishes, since the threshold is already 
settled. However, in the general case when we work with a soft probabilistic classifier, we have to decide 
how to establish the threshold. The crucial idea explored in this paper is the notion of threshold choice 



^Not doing so, or just considering one of the correct classifications to have cost will lead to results which are different to the 
simplified setting by a constant term or factor, as happens with the model for cost-loss ratio used by Murphy in 1141 . 

Hand 1 1 1 , pi 15] assumes b and c to be independent, and hence considers b not necessarily a constant. However, in the end, he 
also assumes that the result is only affected by a constant factor. 



method, a function T{c) or T{z), which converts an operating condition (cost proportion or skew) into an 
appropriate threshold for the classifier. There are several reasonable options for the function T . We can set 
a fixed threshold for all operating conditions, we can set the threshold by looking at the ROC curve (or its 
convex hull) and using the cost proportion or the skew to intersect the ROC curve (as ROC analysis does), 
we can set a threshold looking at the estimated scores, especially when they represent probabilities, or we 
can set a threshold independently from the rank or the scores. The way in which we set the threshold may 
dramatically affect performance. But, not less importantly, the performance measure used for evaluation 
must be in accordance with the threshold choice method. 

From this interpretation, Adams and Hand lU suggest to set a distribution over the set of possible 
operating points and integrate over them. In this way, we can define the overall or average expected loss in 
a range of situations as follows: 

Lc= I Qc{Tc{c);c)wc{c)dc (4) 

Jo 

where Qc{t) is the expected cost for threshold t as seen above, Tc is a threshold choice method, which 
maps cost proportions to thresholds, and Wc{c) is a distribution for costs in [0,1]. Clearly we see that 
any performance measure which attempts to measure average expected cost in a wide range of operating 
condition depends on two things. First, the distribution Wc{c) that we use to weight the range of conditions. 
Second, the threshold choice method Tc. Additionally, we can define this overall or average expected cost 
to be independent of the class priors, so defining a similar construction for skews instead of costs: 

L,^ ( Q,{Uz);z)w,{z)dz (5) 

Jo 

If we draw Qc or Q^ over c and z respectively, we get a plot space known as cost plots or curves, as we 
will illustrate below. Cost curves are also known as risk curves (see, e.g. [.16.1 . where the plot can also be 
shown in terms of priors, i.e. class proportions). 

So a cost curve as a function of z in our notation is simply: 

CC,{z)^Q,{T{z);z) (6) 

and similarly for cost proportions. Note that it is the threshold choice method T which can draw a different 
curve for the same classifier. 

2.3 Some common plots and measures 

In what follows, we introduce some common evaluation measures: the Brier Score, the ROC space and 
the Area Under the ROC curve {AUC). In the following section we also introduce the convex hull and the 
optimal cost curves. 

The Brier score is a well-known evaluation measure for probabilistic classifiers. It is an alternative name 
for the Mean Squared Error or MSE loss 121, especially for binary classification. BS{m,D) is the Brier 
score of classifier m with data D; we will usually omit m and D when clear from the context. We define 
BSk{m,D) = BS{m,Dk). BS is defined as follows: 

BS^-Y (si - jif = KoBSo + TTiBSi (7) 

where Sj is the score predicted for example / and y, is the true class for example /. The corresponding 
population quantities are BSo = Jq s^fo{s)ds and BSi = /o (1 —syfi{s)ds. 

The ROC curve IITtU tI is defined as a plot of Fi (?) (i.e., false positive rate at decision threshold t) on the 
A:-axis against Fo{t) (true positive rate at t) on the j-axis, with both quantities monotonically non-decreasing 



with increasing t (remember that scores increase with ^(l|x) and 1 stands for the negative class). Figurefl] 
(Leftmost: dash lines) shows a ROC curve for a classifier with 4 examples of class 1 and 1 1 examples of 
class 0. Because of ties, there are 1 1 distinct scores and hence 1 1 bins/segements in the ROC curve. 
From a ROC curve, we can derive the Area Under the ROC curve {AUC) as: 

AUC ^ Fo{s)dFi{s)= Fo{s)fiis)ds = / fo{t)fi{s)dtds (8) 

Jo J — oo J — oo J — oo 

= {l-F,{s))dFo{s)= {l-Fi{s))Ms)ds= / Mt)Ms)dtds 

Jo J —oo J— oo Js 

When deahng with empirical distributions the integral is replaced by a sum. 

3 The optimal cost curve 

Given a scoring (or soft) classifier, one approach for choosing a classification threshold is to consider that 
(1) we are having complete information about the operating condition (class proportions and costs) and (2) 
we are able to use that information to choose the threshold that will minimise the cost using the current 
classifier. ROC analysis is precisely based on these two points and, as we have seen, using the skew and the 
convex hull, we can calculate the threshold which gives the smallest loss (for the training set). 
This threshold choice method, denoted by T° is: 

r/(c) ^argmin{Q,(f;c)} 

t 

= argmin 2{ctiq{\ - Fo{t)) + {\ - c)niFi{t)} (9) 

t 

which matches the optimal threshold for a given skew z'- 

T^{z) ^ argmin {QS^z)] = T^{c) (10) 

t 

This threshold gives the convex hull in the ROC space. The convex hull of a ROC curve (ROCCH) is 
a construction over the ROC curve in such a way that all the points on the ROCCH have minimum loss 
for some choice of c or z- This means that we restrict attention to the optimal threshold for a given cost 
proportion c. Note that the argmin will typically give a range (interval) of values which give the same 
optimal value. The convex hull is defined by the points {F\{t),FQ{t)} where t = T"{c) for some c. Then, 
in order to make a hull, all the remaining points are linearly interpolated (pairwise). All this is shown in 
Figure [T] (leftmost). The Area Under the ROCCH (denoted by AUCH) can be computed in a similar way as 
the AUC with modified versions of //, and F^. Obviously, AUCH > AUC, with equality implying the ROC 
curve is convex. 

A cost plot as defined by (Si has Qz{t',z) on the y-axis against skew z on the jc-axis (Drummond and 
Holte use the term 'probability cost' rather than skew). Since Qz{t;z) = z(l — Fo{t)) + (1 — z)F\{t), cost 
lines for a given decision threshold t are straight lines Q^ = ao + aiz with intercept hq = Fi{t) and slope 
ai = I — Fo{t) — Fi{t). A cost line visualises how cost at that threshold changes between Fi (t) for z = 
and 1 —F(){t) for z = 1. 

From all the set of cost lines, we can choose line segments and by piecewise connecting them we have 
a 'hybrid cost curve' (5). One way of choosing these segments is by considering the optimal threshold. 
Hence, the optimal or minimum cost curve is then the lower envelope of all the cost lines, obtained by only 
considering the optimal threshold (the lowest cost line) for each skew. The cost curve for this optimal choice 
is just given by instantiating equation ([6]) with the optimal threshold choice method. Namely, for skews, we 
would have: 

CC^(z)^e,(r;(z);z) (11) 
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Figure 1: Several graphical representations for the classifier with probability estimates (0.95, 0.90, 0.90, 
0.85, 0.70, 0.70, 0.70, 0.55, 0.45, 0.20, 0.20, 0.18, 0.16, 0.15, 0.05) and classes (1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 
0, 0, 0, 1, 0) Left: ROC curve (solid) and convex hull (dashed). Middle: cost lines and optimal cost curve 
against cost proportions. Right: cost lines and optimal cost curve against skews. 



Following the classifier and the ROC curve shown in Figure[T](leftmost), we also show the optimal cost 
curve (rightmost) for that classifier. We observe 7 segments in the original ROC curve on the left, and 5 
segments in its convex hull. We see that these 5 segments correspond to the 5 points in the optimal cost 
curve on the right. The optimal cost curve is 'constructed' as the lower envelope of the 12 cost lines (one 
more than the number of distinct scores). 

The middle plot in Figure [T] is an alternative cost plot with cost proportion rather than skew on the x- 
axis. That is, here the cost lines are straight lines Q^ = q'q + a\c with intercept a'^ = 2ii\Fi{t) and slope 
a\ = 27ro(l — ^o(O) ~ 2711^1 {t). We can clearly observe the class imbalance. 

For the classifier shown in Figure [T] if we are given an extreme skew= 0.8, we know that any threshold 
between 0.90 and 0.95 will be optimal, since it will classify example 15 as negative (1) and the rest as 
positive (0). This cutpoint (e.g. t = 0.92 € [0.90,0.95]) gives Fq = \ 1/1 1 and Fi = 3/4, and minimises the 
loss for this skew, as given by Eq. (|3}, i.e. 2^.(0.92,0.8) = 0.8 * (1 - 11/11) + (1 - 0.8) * (3/4) =0.15. 
Another cutpoint, e.g. t = 0.85, gives Fq = 10/11 and Fi = 2/4, with a higher 2.(0.85,0.8) = 0.8 * (1 - 
10/11) + (1-0.8)* (2/4) =0.17. 

We may be interested in calculating the area under this optimal cost curve. If we use skews, we can 
derive: 

L':^ j'^Q,{T^{z);z)w,{z)dz (12) 

But this equation is exactly the TEC (from 'Total Expected Cost') given by Drummond and Holte (JH page 
106, bottom). Drummond and Holte use the term 'probability times cost' for skew (or simply, and somewhat 
misleadingly, 'probability cost'). The distribution of probability costs is denoted by prob{x) {Wz{z) in our 
notation). For prob{x), Drummond and Holte choose the uniform distribution, i.e.: 



J" 



Q,{T^{z);z)U{z)dz 



(13) 



This expression is just the area under the cost curve. In Drummond and Holte's words: "The area under 
a cost curve is the expected cost of the classifier assuming all possible probability cost values are equally 
likely, i.e. that prob{x) is the uniform distribution." (prob{x) is Wz{z) in our notation). 

The problem of all this is that we are not always given all the information about the operating condition. 
In fact, even having that information, there are perfect techniques (namely ROC analysis) to get the optimal 



threshold for a data set (e.g. the training or vahdation data set), but this does not ensure that these choices 
are going to be optimal for a test set. Consequently, evaluating classifiers in this way is a strong assumption. 
Additionally, how close the estimated optimal threshold is to the actual optimal threshold may depend on 
the classifier as well. One option is to consider confidence bands, but another option is just to drop this 
assumption. 

4 The ROC cost curve 

The easiest way to choose the threshold is to set it independently from the classifier and also from the 
operating condition. This mechanism can set the threshold in an absolute or a relative way. The absolute 
way, as explored in |8], just sets T{c) = t (or, for skews, T{z) = t), with t being a fixed threshold. A simple 
variant of the fixed threshold is to consider that it is not the absolute value of the threshold which is fixed, 
but a relative rate or proportion r over the data set. In other words, this method tries to quantify the number 
of positive examples given by the threshold. For example, we could say that our threshold is fixed to predict 
30% positives and the rest negatives. This of course involves ranking the examples by their scores and 
setting a threshold at the appropriate position. We will develop this idea for cost proportions below. 

4.1 The ROC cost curve for cost proportions 

The definition of of the rate-fixed threshold choice method for costs is as follows: 

T^y\{c)^{t:P{si<t)=r] (14) 

In other words, we choose the threshold such that the probability that a score is lower than the threshold - 
i.e., the positive prediction rate, is r. In the example in FigurefTl any value in the interval [0.3, 0.2] makes that 
the probability (or proportion) of the score being lower than that value is 2/6 = 0.33, which approximates 
r = 0.3. 



It is interesting to connect the previous expression of this threshold given by Eq. ( 14 1 with the cumulative 
distributions. 

Lemma 2. 

T,^[r]{c) = {t:Fo{t)%o+F,{t)Tiy=r] (15) 

Proof. We can rewrite: 

P{s <t)= P{s < t\0)P{0)+P{s < t\l)P{l) 

But using the definition of P{s <t\0) and P{s < ?| 1) in the preliminaries in terms of the cumulative distri- 
butions, we have: 

P{s<t)=Fo{t)P{0)+Fi{t)P{\)=Fo{t)no + Fi{t)ni 

so substituting into Eq.[l4]we have the result: 

T^'[r]ic) = {t : Foit)7io + Fi{t)7i, = r} 

D 

This straightforward result shows that this criterion clearly depends on the classifier, but it only takes the 
ranks into account, not the magnitudes of the scores. 

However, there is a natural way of setting the positive prediction rate in an adaptive way. Instead of 
fixing the proportion of positive predictions, we may take the operating condition into account. If we have 
an operating condition, we can use the information about the skew or cost proportion to adjust the positive 





Figure 2: Several graphical representations for the classifier with probability estimates (0.95,0.9,0.8,0.3,0.2, 
0.1,0.05) and classes (1,0,1,1,0,0,0). Left: ROC curve (solid) and convex hull (dashed). Right: cost lines, 
optimal cost curve (dashed) and ROC cost curve (thick and solid) against cost proportions. 



prediction rate to that proportion. This leads to the rate-driven threshold selection method: if we are given 
cost proportion c, we choose the threshold t in such a way that we get a proportion of c positive predictions. 



r^{c)^T![c]{c) = {t:P{s,<t)=c} 
And given this threshold selection method, we can now derive its cost curve: 

CC'^,{c)^Q,{T-{c)-c) 
Because of Lemma [2| we can see that this is equivalent to: 

CC",{c) = Q,{{t:Fo{t)7iQ + Fy{t)%,=cy,c) 



(16) 



(17) 



(18) 



Assuming no ties, we see that the expression 7^o(0^o + ^i(0^i oi^ly changes its value between scores. If 
have n examples, it only changes n-\-\ times. So for finite populations, this has to be rewritten as follows: 



CQ"(c) = e,({?:c- 



1 



n+\ 



<Foit)7iQ + Fi{t)7li <cy,c) 



(19) 



This leads ton+l intervals in cost space where the threshold is not changed in each of these intervals. This 
means that the cost line is the same. This leads to the following procedure: 

ROC cost curve for cost proportions: CC" 

Given a classifier and a data set with n examples: 

1. Draw the n + l cost lines, CLq to CL,^. 

2. From left to right, draw the curve following each cost line (from CLq to CL„) with a width on the 
x—axis of j^. 

Figure |2] shows a small classifier for a data set with 3 positive examples and 4 negative examples. The 
ROC curve on the left has 8 points, since there are 8 cut points to choose the threshold, leading to 8 crisp 
classifiers and, accordingly, 8 cost lines. These cost lines are shown in the cost space in the plot on the 
right. We see that the projection of each segment onto the x-axis has exactly a length of 1/8. Note that each 
segment uses a portion of each cost line. 



It is relatively easy to understand what these curves mean and to see their correspondence to ROC curves. 
Following Figure|2j going from (0,0) to (1,1) in the ROC curve, the first three points are sub-optimal. The 
fourth point is a good point, because this point is going to be chosen for many slopes. The fifth and the sixth 
are bad points, since they are under the convex hull, and they will never be chosen. The seventh is a good 
point again. The eighth is a bad point. This is exactly what the ROC cost curve shows. Only the fourth 
and seventh segments are optimal and match the optimal cost curve. So, the ROC cost curve has a segment 
intersecting with the optimal cost curve for every point on the convex hull. All other segments correspond 
to sub-optimal decision thresholds. 

5 The area under the ROC cost curve 



If we plug the rate-driven threshold choice method T" (Eq. 16 1 into the general formula of the average 
expected cost for a range of cost proportions (Eq.Q we have: 

L'l^ [ Q,{Tl'{c)-c)w,{c)dc (20) 

Jo 

Using the uniform distribution, this expected loss equals the area under the ROC cost curve. It can be linked 
to the area under the ROC curve as follows. 



Proposition 3. 




Proof. 


Ll^,)=2nn 


T" 


= l'Q,{T^{c)-c)U{c)dc 
Jo 



I 2{c7r„(l-Fo(r;(c))) + (l-c);riFi(r;(c))}Jc 

JO 



2{cno-c[7ioFo{T^{c)) + KyFy{T:!{c))]}dc+ f 2{KiFi{T^\c))}dc 
10 Jo 

From Lemma [2] we have that: 

r/[r](c) = {t ■ Fo{t)no+Fi{t)7ii = r} 

and of course 

r;(c) = {t : Fo{t)no + Fi{t)ni = c} 

Since this is the t which makes the expression equal to c we can find that expression and substitute by c. 
Then we have: 

L'^(,) = J l{cno-c{c)}dc + j 2{KiF,{r^{c))}dc 

+ f 2{K,F,{T^{c))}dc 

I JO 

710-^ + 2711 / Fi{T;!{c))dc 
3 Jo 



en,-- 



We have to solve the term /q Fi {T^{c))dc. In order to do this, we have to see that the use of T"{c) and 
integrating over dc is like using the mixture distribution for thresholds t and integrating over dt. 

/ F,{rj[c))dc = / Fi(t){%oMt) + Tlify{t))dt 

Jo J-~ 

/OO /"OO 

Fi{t)Mt)dt + Ki / Fi{t)fi{t)dt 
-OO J — CO 

/OO /"OO 

i-\ + \-Fi{t))Mt)dt + 7l, / Fi{t)dFi{t) 
-OO J — OO 

/OO /"OO /"OO 

-Idt-Tio {l-Fi{t))foit)dt + 7ii Fi{t)dFiit) 
-OO ^ — OO J ^oo 

Til 

= 710-710AUC + — 

= 7ro(l -A[/C) + y 
And now we can plug this in the expression for the expected cost: 

^"u{c) = 7io-^ + 2ni{7io{l-AUC) + ^) 

2 
= tiq- - + 27ii7io{l- AUG) + 711K1 

2 
= 27:ino{'\--AUC) + K[{\-Ko) + Ko-- 

2 
= 2Ki7Co{\ - AUG) + \ - KiTlo - :z 

= 27i:i7io{l -AUG) + --7ii7io 



D 



This shows that not only has this new curve a clear correspondence to ROC curves, but its area is linearly 
related to AUG. 

From costs to skews we have by Lemma [T]: 

Corollary 4, 

_\-AUC 1 

^f/(z) - 2 + 12 

Thus, expected loss is 1/3 for a random classifier, 1/3 — 1/4 = 1/12 for a perfect classifier and 1/3 -|- 1/4 = 
7/12 for the worst possible classifier. 

The previous results are obtained for continuous curves with an infinite number of examples. For em- 
pirical curves with a limited number of examples, the result is not exact, but a good approximation. For 
instance, for the example in Figure |2j we have that AUG is 0.83333. The area under the ROC cost curve is 
0.1695 for cost proportions, while the theoretical result 2nino{l —AUG) + ^ — tTiJTo gives 0.1701. It should 
be possible to come up with an exact formula for empirical ROC curves; we leave this as an open problem. 

It is interesting to use these general results to get more insight about what the ROC cost curves mean 
exactly. For instance. Figure [3] shows the ROC curve and the ROC cost curves for a perfect ranker and a 
balanced data set. We used a large number of split points in the ranking to simulate the continuous case. 
We see that our new threshold choice method makes optimal choices for c = 0, c = 1/2 and c = 1 but sub- 
optimal choices for other operating conditions, which explains the non-zero area under the ROC cost curve 
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0.0 0.2 0.4 



Figure 3: Several graphical representations for a perfect and balanced classifier with 20 positive examples 
and 20 negative examples. Left: ROC curve (solid) and convex hull (dashed). Right: cost lines, optimal cost 
curve (dashed) and ROC cost curve (thick and solid) against cost proportions. 

(1/12 in the continuous case). The optimal choice in this case is to ignore the operating condition altogether 
and always split the ranking in the middle. 

Figure |4] shows what the ROC cost curve looks like for the worst ranker possible. The lower envelope 
of the cost lines shows that in this case the optimal choice is to always predict if c < 1/2 and 1 if c > 1/2, 
which results in an expected loss of 1 /4. In contrast, our new threshold choice method also takes the non- 
optimal split points into account and hence incurs a higher expected loss (7/12 in the continuous case). 
Figure|5]shows what the ROC cost curve looks like for a classifier which is alternating (close to the diagonal 
in the ROC space) with AUC ss 0.5. Here, the expected loss approximates 4/12 = 1/3, while the optimal 
choice is the same as in the previous case. It is not hard to prove that in the limiting case n — )• oo, the ROC 
cost curve for a random classifier is described by the function y = 2c{l —c), which is the Gini index (the 
impurity measure, not to be confused with the Gini coefficient which is 2AUC — 1). 

6 Evenly-spaced scores. The relation between AUC and the Brier score 

An alternative threshold choice method is to choose p(l|x) = op where op is the operating condition. This 
is a natural criterion as it has been used especially when the classifier is a probability estimator. Drummond 
and Holte l^'] say it is a common example of a "performance independent criterion". Referring to Figure 
22 in their paper which uses this threshold choice they say: "the performance independent criterion, in this 
case, is to set the threshold to correspond to the operating conditions. For example, if PC{+) - 0.2. the 
Naive Bayes threshold is set to 0.2". The term PC(+) is equivalent to our 'skew'. 

Let us see the definition of this method, that we call probabilistic threshold choice (as presented in |[T2]| ). 
We first give the formulation which uses cost proportions for operating conditions: 



Tfic)^c 



(21) 



We define the same thing for Tf (z) : 



Triz)^z 



(22) 



If we plug Tf into the general formula of the average expected cost (Eq.|4]) we have the expected probabilistic 
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Figure 4: Several graphical representations for a very bad classifier (all are ranked before the 1) and 
balanced classifier with 20 positive examples and 20 negative examples. Left: ROC curve (solid) and 
convex hull (dashed). Right: cost lines, optimal cost curve (dashed) and ROC cost curve (thick and solid) 
against cost proportions. 
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Figure 5: Several graphical representations for an alternating (order is 1,0,1,0, ....) and balanced classifier 
with 20 positive examples and 20 negative examples. Left: ROC curve (solid) and convex hull (dashed). 
Right: cost lines, optimal cost curve (dashed) and ROC cost curve (thick and solid) against cost proportions. 
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cost: 

fi pi 



LP= I Qc{TP{c);c)wc{c)dc= [ Qc{c;c)w,{c)dc (23) 

JO JO 

And if we use the uniform distribution and the definition of Qc (Eq.[l]|: 

Z^^(,) = f^Qc{c;c)U{c)dc 

= I 2{c%Q{\-FQ{c)) + {\-c)KxF^{c)}U{c)dc 
Jo 

= f 2{c7io{l-Fo{c))}dc+ f 2{{\-c)KiFi{c)}dc (24) 

JO JO 

From here, it is easy to get the following: 
Theorem 5 ( II12I '). The expected loss using a uniform distribution for cost proportions is the Brier score. 
Proof. We have BS = UqBSo + iiiBSi . Using integration by parts, we have 

BSo = j s^fo{s)ds=[s^Fo{s)]l^^-J 2sFo{s)ds 

= 1— / 2sFQ{s)ds = / 2sds — / 2sFQ{s)ds 
JO JO JO 

Similarly for the negative class: 

BSi= f {\-s)^fy{s)ds 
JO 

= [{l-s)^Fi{s)]l^ + l\i\-s)Fi{s)ds 

= f 2{\ - s)Fi{s)ds 
Jo 

Taking their weighted average, we obtain 



BS = noBSo + 7riBSi 

'1 



/ {Koi2s - 2sFq{s)) + ;ri2(l - s)Fi {s)}ds 
Jo 



which, after reordering of terms and change of variable, is the same expression as Eq. ( 24 1. D 



In fTT\ we introduced the Brier curve as a plot of Qc{c; c) against c, so this theorem states that the area under 
the Brier curve is the Brier score. 

Given a classifier with scores, we may use its scores to try to get better threshold choices with this choice, 
or we may ignore the scores and use evenly-spaced scores. Namely, we can just assign the n scores such 
that Si = ^^, going then from to 1 with steps of j^. With this simple idea we see that the probabilistic 
threshold choice method reduces to T", which was analysed in the previous two sections. And now we get 
a very interesting result. 

Corollary 6. If scores are evenly spaced, we get that: 

27ti7ioi^ - AUC) + - - tiiKq = BS (25) 
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Figure 6: Several graphical representations for a ranker with evenly spaced scores 1 0.957 0.913 0.870 0.826 
0.782 0.739 0.696 0.652 0.609 0.565 0.522 0.478 0.435 0.391 0.348 0.304 0.261 0.217 0.174 0.130 0.087 
0.043 and true classes 1,1,1,1,0,144,0,1,1,1,1,0,1,0,1,1,0,0,0,0,1,0 (15 negative examples and 9 positive 
examples). Left: ROC curve (solid) and convex hull (dashed). Right: cost lines, optimal cost curve (dashed), 
ROC cost curve (brown, thick and solid) and Brier curve (pink, thin and solid) against cost proportions. 

As far as we are aware, this is the first published connection between the area under the ROC Curve and 
the Brier score. Of course, this is related to the Brier curves introduced in |[T2ll . so we can also say that 
Brier curves and ROC cost curves are closely related (have the same area) if the classifier has evenly-spaced 
scores O 

Figure |6] shows a classifier with evenly-spaced scores, so that the previous corollary holds. We can see 
that the Brier curve and the ROC cost curve have similar shapes, although they are not identical. We have 
that the MJC = 0.7777, 27ri7ro(l -^UC) + ^ - ttiTTo = 0.203125, where the Brier score is 0.2047101 and 
the area under the Brier curve is 0.2006o 

Finally, we show a perfectly calibrated classifier and the ROC cost curves with the Brier curves in Figure 
|7] The pink curve (Brier curve) for cost proportions matches the black curve (the optimal curve). The ROC 
cost curve shows that the rate-driven threshold choice methods sometimes makes sub-optimal choices: for 
example, it only switches to the second point from the left in the ROC curve when c = 4/1 1 = 0.36, whereas 
the optimal decision would be to switch to this point from c = 0.25. 

7 Conclusions 

The definition of cost curve in the literature has been partially elusive. While it is clear what cost lines are, it 
was not clear what different options we may have to draw different curves on the cost space, which of them 



"^Working with skews instead of cost proportions, the derivation should lead to a corresponding equation to Corollary pi i.e. 
^(1 -AUC) + T^ = 5^a±^. This exercise is left to the reader. 

These two latter numbers should be exactly equal but some small problems when dealing with ties in the implementation of 
the curves are causing this small difference. 
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Figure 7: Several graphical representations for a perfectly calibrated classifier with scores 
(1,0.833333,0.833333,0.833333,0.833333,0.833333,0.833333, 0.25, 0.25, 0.25, 0.25) and true classes 
(1,1,1,1,1,1,0,0,0,0,1) (4 positive examples and 7 negative examples). Left: convex ROC curve. Right: 
cost lines, optimal cost curve (dashed), ROC cost curve (brown, thick and solid) and Brier curve (pink, thin 
and solid) against cost proportions. 

were valid and which were not, and, more importantly, if they correspond to some curves or representations 
in ROC space. 

In this paper, we have clarified the relation between ROC space and cost space, by finding the corre- 
sponding curves for ROC curves in cost space. These represent cost curves for rankers that do not commit to 
a fixed decision threshold. Cost plots have some advantages over ROC plots, and the possibility of drawing 
ROC cost curves may give further support to use cost plots and use their ROC cost curves there. 

In addition, we have shown that when the scores of a classifier are set in an evenly-spaced way, the 
ROC cost curves correspond to the previously presented Brier curves and we have the first firm connection 
between the Brier Score and AUC. This also suggests that there might be a way to draw Brier curves in ROC 
space. 

Given the exploratory character of this paper, there are many interesting options to follow up. Our focus 
will be on how to use ROC cost curves to choose among models and construct hybrid classifiers. 
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